Prosper Loan Dataset EDA with R by Gaurav Ansal

INTRODUCTION

The dataset investigated is from Prosper, which is a peer-to-peer money investing or borrowing website. It works in the following way: borrowers choose loan amount, purpose and post a loan listing; investors review loan listings and invest in listings they are interested in; once the process is complete, borrowers make fixed monthly payments and investors receive a portion of those payments directly to their Prosper account.

My main goals for this exploratory data analysis is two folds. The first one is to understand some of the variables and visualize the distribution. The second one is try to find possbile correlations among the variables.

The methodology is through univariate, bivariate and multivariate analysis to explore this dataset. The tool I will be using is R’s visualization package ggplot2 and linear model.

Loading the Prosper Loan dataset

## [1] 113937     81

The full Prosper loan data in dataframe prosper.full has 113,937 transaction record(observations) with 81 variables(cloumns).

Changing the names of two variables in the prosper.full dataframe i.e. Listing Category..numeric. to ListingCategory.numeric & ProsperRating..numeric. to ProsperRating.numeric

I explored 20 out of these 81 variables. I started by looking at the documentation and tried to find interesting variables.I took 19 variables and subsetted them into a new dataframe called prosper.

Also, Created another variable called CreditScore.avg which is the average of CreditScoreRangeLower & CreditScoreRangeUpper.

So, now we have 20 variables in total. Below are the names of the variables in the subset dataframe prosper which we would use for exploration.

## [1] 20
##  [1] "BorrowerRate"             "EmploymentStatus"        
##  [3] "EmploymentStatusDuration" "IncomeRange"             
##  [5] "StatedMonthlyIncome"      "IsBorrowerHomeowner"     
##  [7] "LoanStatus"               "ListingCategory.numeric" 
##  [9] "CreditScoreRangeLower"    "CreditScoreRangeUpper"   
## [11] "DebtToIncomeRatio"        "RevolvingCreditBalance"  
## [13] "OpenCreditLines"          "InquiriesLast6Months"    
## [15] "Term"                     "LoanOriginalAmount"      
## [17] "LoanOriginationDate"      "LoanOriginationQuarter"  
## [19] "ProsperRating.numeric"    "CreditScore.avg"

Now let’s look at the structure of the dataset prosper.

## 'data.frame':    113937 obs. of  20 variables:
##  $ BorrowerRate            : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ EmploymentStatus        : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration: int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IncomeRange             : Factor w/ 8 levels "$0 ","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ StatedMonthlyIncome     : num  3083 6125 2083 2875 9583 ...
##  $ IsBorrowerHomeowner     : logi  TRUE FALSE FALSE TRUE TRUE TRUE ...
##  $ LoanStatus              : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ListingCategory.numeric : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ CreditScoreRangeLower   : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper   : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ DebtToIncomeRatio       : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ RevolvingCreditBalance  : int  0 3989 NA 1444 6193 62999 5812 1260 9906 9906 ...
##  $ OpenCreditLines         : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ InquiriesLast6Months    : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ Term                    : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanOriginalAmount      : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate     : Factor w/ 1873 levels "1/10/2006 0:00",..: 1729 880 35 314 1783 541 974 1099 476 476 ...
##  $ LoanOriginationQuarter  : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ ProsperRating.numeric   : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ CreditScore.avg         : num  650 690 490 810 690 ...

Converting LoanOriginationDate from factor of character type to Date type.

Univariate Plots Section

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

From the above histogram with binwidth of 0.001,BorrowerRate count appears to be well distributed with high and low count althroughout the distribution with the exception at the end of the tail where rate of around 0.32 has high count around 3800 and rate of around 0.35 has high count around 2000.

The above bar graph shows that the Employed people alongwith the Full Time hired people are more in count. On the contrast,Not Employed, Part Time and Retired peopled are less. This also shows the people with secured income have more eligibility then non-secured income people. That is the reason, that there are more people with secured income in the dataset.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   26.00   67.00   96.07  137.00  755.00    7625

The distribution of the variable EmploymentStatusDuration with binwidth of 12 months(1 year) is positively skewed. Therefore, the count gradually decreases with the higher duration. People with more employmentstatus duration have more statbility and thus are more eligible for the loan.

Notice, how the category for 100K+ is placed in between other levels.

## [1] "$0 "            "$1-24,999"      "$100,000+"      "$25,000-49,999"
## [5] "$50,000-74,999" "$75,000-99,999" "Not displayed"  "Not employed"

The levels of income range has been changed and are now in order. See below.

## [1] "$0 "            "$1-24,999"      "$25,000-49,999" "$50,000-74,999"
## [5] "$75,000-99,999" "$100,000+"      "Not displayed"  "Not employed"

The above bar graph shows that there seems to be a normal distribution of the count of the IncomeRange data. Count of People with more loans are people with the medium range of Income. Count of people with low and high range of Income are less comparatively. Maybe, People with low income range are not able to qualify for the loan and people with high IncomeRange does not need loan most of the times, so they don’t apply for the loan.

Below is the summary of the StatedMonthlyIncome.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750000

There is huge amount of difference between Max. value and other quartiles of the distribution. Therefore, removing the top 1% of the values in StatedMonthlyIncome. Below is the summmary after removal.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    5132   10260   10260   15400   20530

Now let us adjust the bindwidth by using Freedman-Diaconis binwidth rule - nclass.FD(x)

## [1] 142

Freedman-Diaconis binwidth for StatedMonthlyIncome after removing top 1% data is 142.

For the above graph, we have eliminated top 1% of the StatedMonthlyIncome. The distribution appears to be normal and postively skewed which is quite obvious.

From the above graph, owning a home doesn’t seem to give more chances to be have loan. As can be seen from the graph, count of home owners and people without home are almost same.

From, the above graph, defaulted and delinquent accounts are very less compared to completed and current loan accounts which are as expected.

From the above graph, it can been seen that most of the loan taken is for Debt Consolidation and followed by Home Improvement and Business purpose as major reasons.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   669.5   689.5   695.1   729.5   889.5     591

From the above graph, most of the applicant have average credit score between 630 and 760 points.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

From the above graph, it appears that DebtToIncomeRatio follows normal distribution between 0 and 1 and is slightly positively skewed. Also, we can see that there are some people with ratio 10.01 which signifies that there are few people who seems to be very high risk borrowers.

Let us now, eliminate the outliers by adjusting the the x-axis.

See how clearly we can see the positive skewness. Vertical line in yellow color passes through the mean.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3121    8549   17600   19520 1436000    7604

From the above graph, log10 of RevolvingCreditBalance follows a normal distribution which is slightly negatively skewed with most people having revolcing credit balance as 10000.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.00    9.00    9.26   12.00   54.00    7604

From the above graph, OpenCreditLines follows a normal ditribution which is slightly positively skewed. Most people on an average has around 10 open credit lines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   1.000   1.435   2.000 105.000     697

As expected, the distribution of the InquiriesLast6Months is positively skewed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   40.83   36.00   60.00

From, the above graph, there are more people with 3 year loan term, then 5 year loan term and very less 1 year loan term.

From the above graph, it appears that people prefer to take loan amount like 4000, 10000, 15000, 20000 & 25000 as we can see the high spikes at these amounts. Listing for lower Loan Amount is more than for the higher amount.

From the above graph, it appears that during the year 2009, very less number of loans were provided and then it picked up and again there was a drip during late 2012 and early 2013. After that it picked up very well and most of the loans are from 2013 and later on.

From the above, it appears that the loan activity during the end of the year and start of the new year drips down i.e. during Q4 and Q1 and then it picks back up in Q2 and Q3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   3.000   4.000   4.072   5.000   7.000   29084

From the above graph, it appears that ProsperRating as 4 is the highest and rating follows a normal distribution which is quite obvious from the fact that the more number of people would have average rating than the low and high rating.

Univariate Analysis

What is the structure of your dataset?

There are 113937 observations and for the scope of this project, I am going to limit the number of variables to 20. Out of 20, 19 are choosen from the dataset and 1 has been created. There are few factor variables like (EmploymentStatus, IncomeRange, LoanStatus, LoanOriginationQuarter), a Date variable(LoanOriginationDate), a Boolean Variable(IsBorrowerHomeowner) and rest are Integer and Numeric variables.

Looking at how prosper works, I selected variables that fits the following criteria:

  1. Basic information, information that a user gives to the site (LoanOrginialAmount, ListingCategory.numeric, IncomeRange, EmploymentStatus, EmploymentStatusDuration, etc.) when they want to register for a loan.

  2. Credit profile information like(CreditScoreRangeUpper, CreditScorerangeUpper, DebtToIncomeRatio,RevolvingCreditBalance, etc.) that may aid in generating the ‘Prosper rating’, ‘Borrower rate’ and ‘Term’. This can be seen in the loan listing page on Prosper site.

  3. Other information, ‘OpenCreditLines’, ‘InquiriesLast6Months’, ‘ListingCreationDate’, etc.

What is/are the main feature(s) of interest in your dataset?

To me, main feature of Interest are BorrowerRate, Loan Amount and Loan Term. As a borrower, one cares about how much interest rate would he be charged and for how much loan amount and for how much duration. I want to know which variables or factors have impact on BorrowerRate.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Other features like IncomeRange or StatedMonthlyIncome, CreditScore.avg, DebtToIncomeRatio, RevolvingCreditBalance, OpenCreditLines, EmploymentStatus, EmploymentStatusDuration, ProsperRating.numeric.

Did you create any new variables from existing variables in the dataset?

Yes, CreditScore.avg which the is average of 2 variables CreditScoreRangeLower and CreditScoreRangeUpper. Instead of having 2 Credit Score for Lower and Upper Range, average Credit Score would be better Credit Score for analyzing the Borrower Rate.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form the data? If so, why did you do this?

EmploymentStatus Duration, StatedMonthlyIncome, OpenCreditLines, RevolvingCreditBalance, InquiriesLast6Months are Positively Skewed.

DebtToIncomeRatio is Slightly Positively Skewed

log10(RevolvingCreditBalance) is Slightly Negatively Skewed

IncomeRange, CreditScore.avg, ProsperRating.numeric follows Normal Distribution.

LoanOriginationDate was a factor of character variable and was converted to Date Type. After conversion, LoanOriginationDate variable was plotted and it helped us to analyze the pattern of Loan demand through the years.

Bivariate Plots Section

Below is the plot of correlation matrixes of all the continous variables.

From the above plot, it appears that there are few strong correlations between few variables. There is strong negative corelation between BorrowerRate and ProsperRatin.numeric. Also, there is a strong positive correlation between ProsperRating.numeric and CreditScore.avg

From the above graph, there does not seem to be a relation between BorrowerRate and LoanOriginalAmount. That means, for lower amount of loan or higher amount loan, BorrowerRate can be low or high depending on other factors. So, BorrowerRate is independent of Loan Amount.

Below is the Summary of borrowerRate for 12 months Term

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0400  0.0929  0.1434  0.1501  0.2064  0.2669

Below is the Summary of borrowerRate for 36 months Term

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1274  0.1815  0.1935  0.2599  0.4975

Below is the Summary of borrowerRate for 60 months Term

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0669  0.1490  0.1870  0.1930  0.2319  0.3304

From the above plot and summary of BorrowerRate for each of the Term, it appears that BorrowerRate is lower for the 12 months Term Loan with an average of 0.1501 . We can see that BorrowerRate for 36 months Term Loan and 60 months Term Loan appears to be nearly same having average of 0.1935 and 0.1930 respectively but BorrowerRate for 36 months Term has large variance.

Below is the Mean Loan Amount of each of the Term

## prosper$Term: 12
## [1] 4694.297
## -------------------------------------------------------- 
## prosper$Term: 36
## [1] 7276.155
## -------------------------------------------------------- 
## prosper$Term: 60
## [1] 12370.4

From the above plot and summary, we can see that people usually take smaller amount of Loan for shorter period of time. However, there are few exceptions which can be depicyted by the outliers in each Term. There seems to some sort of coorelation between two.

From the above graph, it appears that there is very light correlation between BorrowerRate and EmploymentStatusDuration.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750000

In the above graph, 99% quartile has been taken for StatedMonthlyIncome as it contained few very high values as can be seen in the summary above. StatedMonthlyIncome and BorrowerRate deos appear to have a light correlation with each other.

From the above graph, it appears that most of the loans were taken for - 1. Debt Consolidation 2. Home Improvement 3. Business 4. Personal Loan 5. Other

From the above graph, it can be said that StatedMonthlyIncome can be equal for different EmploymentStatusDuration. As the EmploymentStatusDuration increases, the StatedMonthlyIncome range becomes narrow and fluctuation between Monthly Incomes decreases.

From the above graph, it appears that there is a decrease in BorrowerRate once the CreditScore.avg goes beyond 600. There appears to some correlation between BorrowerRate and CreditScore.avg. Also, we can see that there are few people whom has been provided loan even though their CreditScore.avg is below 25. It needs to be investigated further as what other factors has influenced their listing in order to have a loan.

Let us now, eliminatw the outliers by adjusting the axis. Below is the graph.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

From the above graph and summary, Prosper is providing loan to people whom do not have DebtToIncomeRatio not more than 1.25 but there are few exceptions as can be seen as outliers. Also, there are lot of loans provided to people whom DebtToIncomeRatio is 10 or more. It has to be investigated further as why these people with high DebtToIncomeRatio were provided loan and what were the factors that influenced in getting these people loan.

## [1]       0 1435667
##      99% 
## 150795.9

In the above graph, we have taken 99% quartile of the RevolvingCreditBalance. It appears that as the RevolvingCreditBalance increases, less number of loans were provided to the people. See how the loan count decreases after RevolvingCreditBalance of 50000 but BorrowerRate kept on fluctuating.

From the above graph, it appears that as the DebtToIncomeRatio is increased from 0 till 1.25, the CreditScore.avg range of score converges from range 450 - 900 to point 700. So if break, CreditScore.avg score in 2 groups 450 to 700 and 700 to 900, variation in score decreases. For group 450 to 700 points, as the DebtToIncomeRatio increases, the chances to have a higher score(700) increase. Similarly, for group 700 to 900, as the DebtToIncomeRatio increases, the chances to have a lower score(700) increases.

Just like DebtToIncomeRatio, the behaviour of RevolvingCreditBalance is similar with CreditScore.avg. Score range of 525-900 converges to 700 as RevolvingCreditBalance increased from 0 to 250000.

From the avove graph, it appears that DebtToIncomeRatio and RevolvingCreditBalance converges from ratio 0-0.8 to 0.27. The variation in ratio decreases as the RevolvingCreditBalance increases.

From the above graph, it appears that there is not any trend between these two variables till 20 OpenCreditLines but after 20, there is variation in BorrowerRate decreases.

Again in the above graph, the Borrowerrate Converges as the InquiriesLast6Months increases. The variation in the BorrowerRate decreases.

Again, in the above graph, the variation in the CreditScore.avg decreases as the OpenCreditLines increases.

Well, in the above graph, the CreditScore.avg decreases as the InquiriesLast6 Months increases. Also, the variation of CreditScore.avg decreases as InquiriesLast6Months increases.

## 
##  Pearson's product-moment correlation
## 
## data:  prosper$BorrowerRate and prosper$ProsperRating.numeric
## t = -917.37, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9537172 -0.9524846
## sample estimates:
##        cor 
## -0.9531049

We can clearly see from the above graph, that the BorrowerRate decreases as the ProsperRating.numeric increases. we can clearly see the linear relationship between these two variables. Also, Pearson’s product-moment correlation between BorrowerRate and ProsperRating.numeric is -0.9531049 which is quite high.These two variables fits in the Linear regression model.

From the above graph, we see that ProsperRating.numeric increases as CreditScore.avg increases. The change can bee seen once CreditSore.avg crosses 725. The variation in the Prosperrating.numeric decreases.

From the above graph, Loan amount provided to the user increases as the user’s ProsperRating.numeric rating increases. We can see that the relationship between these two variables is negatively skewed.

From the above graph, it appears that most of the Loan Term is 36 months irrespective of ProsperRating.numeric. Least Term offered is 12 months to the user.

From the above graph, it appears that people with High Prosper Rating has maximum average StatedMonthlyIncome.

Pearson’s product-moment correlation

Below is the list of Pearson’s product-moment correlation between few independent varibales and BorrowingRate.

1. ProsperRating.numeric

## 
##  Pearson's product-moment correlation
## 
## data:  prosper$BorrowerRate and prosper$ProsperRating.numeric
## t = -917.37, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9537172 -0.9524846
## sample estimates:
##        cor 
## -0.9531049

2. CreditScore.avg

## 
##  Pearson's product-moment correlation
## 
## data:  prosper$BorrowerRate and prosper$CreditScore.avg
## t = -175.17, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4661358 -0.4569730
## sample estimates:
##        cor 
## -0.4615667

3. OpenCreditLines

## 
##  Pearson's product-moment correlation
## 
## data:  prosper$BorrowerRate and prosper$OpenCreditLines
## t = -34.76, df = 106330, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1119376 -0.1000515
## sample estimates:
##        cor 
## -0.1059984

4. StatedMonthlyIncome

## 
##  Pearson's product-moment correlation
## 
## data:  prosper$BorrowerRate and prosper$StatedMonthlyIncome
## t = -30.155, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09473938 -0.08321827
## sample estimates:
##        cor 
## -0.0889818

5. RevolvingCreditBalance

## 
##  Pearson's product-moment correlation
## 
## data:  prosper$BorrowerRate and prosper$RevolvingCreditBalance
## t = -19.472, df = 106330, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06559529 -0.05361688
## sample estimates:
##         cor 
## -0.05960823

6. EmploymentStatusDuration

## 
##  Pearson's product-moment correlation
## 
## data:  prosper$BorrowerRate and prosper$EmploymentStatusDuration
## t = -6.4922, df = 106310, p-value = 8.499e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02591549 -0.01389795
## sample estimates:
##         cor 
## -0.01990744

7. Term

## 
##  Pearson's product-moment correlation
## 
## data:  prosper$BorrowerRate and prosper$Term
## t = 6.781, df = 113940, p-value = 1.199e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.01428050 0.02588888
## sample estimates:
##        cor 
## 0.02008537

8. DebtToIncomeRatio

## 
##  Pearson's product-moment correlation
## 
## data:  prosper$BorrowerRate and prosper$DebtToIncomeRatio
## t = 20.465, df = 105380, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.05690080 0.06892819
## sample estimates:
##        cor 
## 0.06291678

9. InquiriesLast6Months

## 
##  Pearson's product-moment correlation
## 
## data:  prosper$BorrowerRate and prosper$InquiriesLast6Months
## t = 62.926, df = 113240, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1781764 0.1894316
## sample estimates:
##     cor 
## 0.18381

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

  1. After particular CreditScore.avg(about 650), BorrowerRate and CreditScore.avg are negatively correlated with each other.

  2. Relationship between the LoanOriginalAmount and ProsperRating.numeric is negatively skewed.

  3. There is a strong negative Correlation between BorrowerRate and ProsperRating.numeric.

There are relationships of other variables with BorrowerRate, Loan Amount and Loan Term but they are not as so strong.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

  1. After particular CreditScore.avg(about 750), CreditScore.avg and ProsperRating.numeric are positively correlated with each other.

  2. There is positive correlation between ProsperRating.numeric and Average StatedMonthlyIncome.

There are other relaionships between other features but they are not as so strong.

What was the strongest relationship you found?

Strongest Relation which I found is of BorrowerRate and Prosperrating.numeric. The Pearson’s product-moment correlation between them is -0.9531049 and of BorrowerRate and CreditScore.avg. The Pearson’s product-moment correlation between them is -0.4615667

Multivariate Plots Section

From the above graph, it is observed that people with high ProsperRating has low BorrowerRate as compared to people with lower ProsperRating within the same range of average CreditScore i.e. 600 - 900. As the ProsperRating lowers down, BorrowerRate goes up.

## `geom_smooth()` using method = 'gam'

## prosper$Term: 12
## [1] 0.2669
## -------------------------------------------------------- 
## prosper$Term: 36
## [1] 0.4975
## -------------------------------------------------------- 
## prosper$Term: 60
## [1] 0.3304

From the above graph, it is observed that for the same range of average CreditScore, BorrowerRate is less for the loan Term of 12 months which is under 0.2669. BorrowerRate for 36 months and 60 months is more than 12 months Term loan. For 36 months, maximum Rate is 0.4975 and for 60 months it is 0.3304

From the above graph, it is observed that for the same ProsperRating, Loan amount and Term are proportional to each other. For shorter Term amount, low amount of loan is provided whereas for longer periods, loan Amount is increased.

Buckets has been created for BorrowerRate with bucket size of 10%.

## 
##  0-10% 10-20% 20-30% 30-40% 
##  12865  53036  35709  12313

From above graph, it is observed, that the people with high Prosper Rating are provided higher amount of loan at low Borrower Rate whereas, people with very low Prosper Rating are provided low amount of loan at high Borrower Rate.

From the above graphs, it of observed that the BorrowerRate decreases as the Loan Term increases for a given ProsperRating.

From the above graph, average StatedMonthlyIncome for people with Loan Term of 12 months is higher in comparison to people with higher Loan Term. People with 36 and 60 months Term loan has nearly same average StatedMonthlyIncome.

## 
##   0-100 101-200 201-300 301-400 401-500 501-600 601-700 701-800 801-900 
##     133       0       0       1     528    6103   52831   48891    4859

Buckets has been created for CreditScore.avg with bucket size of 100 from 0 to 900. From the above graph, it is observed that buckets from 600 to 900 holds the behaviour of the BorrowerRate and ProsperRating.numeric. Also, see the how the linear model line fits in the distribution.

Now, let us find out what are the various factors which influences the Prosper Rating.

Below i have tried to showcase the same by considering only the few variables among the selected variables by me for the analysis process.

Subsetting, Scaling, Grouping, Taking Average and geom_path Plotting the variables

This plot shows the effect of several factors on ProsperRating. The data for each factor is scaled, grouped by prosper rating and averaged. I used path plot to show how each factor trend through ProsperRating. Since it is scaled, the point also show how far it is from the mean (the 0 line).

From the above graph, it can be observed that for higher Propser Rating, Credit score is the most important factor followed by Stated Monthly Income, Inquiries in the Last 6 months, debt to Income Ratio, Open Credit Lines and Revolving balance respectively.

  • Prosper Rating higher up as the Credit Score and Stated Monthly Income increases.
  • Also, observe how the Rating lower down as the Inquiries and Debt to Income Ratio increases.
  • Also, Rating higher up slightly as Revolving Credit Balance and Open Credit Lines increases.

See how the points for Rating 1 and Rating 7 swaps their position and direction with respect to vertical line at 0. Also, see points for Rating 4, all the points comes close to each other within a very small range.

Building a Linear Model

## 
## Calls:
## m1: lm(formula = BorrowerRate ~ ProsperRating.numeric, data = prosper)
## m2: lm(formula = BorrowerRate ~ ProsperRating.numeric + CreditScore.avg, 
##     data = prosper)
## m3: lm(formula = BorrowerRate ~ ProsperRating.numeric + CreditScore.avg + 
##     OpenCreditLines, data = prosper)
## m4: lm(formula = BorrowerRate ~ ProsperRating.numeric + CreditScore.avg + 
##     OpenCreditLines + StatedMonthlyIncome, data = prosper)
## m5: lm(formula = BorrowerRate ~ ProsperRating.numeric + CreditScore.avg + 
##     OpenCreditLines + StatedMonthlyIncome + RevolvingCreditBalance, 
##     data = prosper)
## m6: lm(formula = BorrowerRate ~ ProsperRating.numeric + CreditScore.avg + 
##     OpenCreditLines + StatedMonthlyIncome + RevolvingCreditBalance + 
##     Term, data = prosper)
## m7: lm(formula = BorrowerRate ~ ProsperRating.numeric + CreditScore.avg + 
##     OpenCreditLines + StatedMonthlyIncome + RevolvingCreditBalance + 
##     Term + DebtToIncomeRatio, data = prosper)
## 
## =======================================================================================================
##                              m1         m2         m3         m4         m5         m6         m7      
## -------------------------------------------------------------------------------------------------------
##   (Intercept)              0.369***   0.348***   0.350***   0.350***   0.350***   0.331***   0.333***  
##                           (0.000)    (0.001)    (0.001)    (0.001)    (0.001)    (0.001)    (0.001)    
##   ProsperRating.numeric   -0.043***  -0.043***  -0.043***  -0.043***  -0.043***  -0.043***  -0.043***  
##                           (0.000)    (0.000)    (0.000)    (0.000)    (0.000)    (0.000)    (0.000)    
##   CreditScore.avg                     0.000***   0.000***   0.000***   0.000***   0.000***   0.000***  
##                                      (0.000)    (0.000)    (0.000)    (0.000)    (0.000)    (0.000)    
##   OpenCreditLines                               -0.000***  -0.000***  -0.000***  -0.000***  -0.000***  
##                                                 (0.000)    (0.000)    (0.000)    (0.000)    (0.000)    
##   StatedMonthlyIncome                                      -0.000     -0.000     -0.000      0.000*    
##                                                            (0.000)    (0.000)    (0.000)    (0.000)    
##   RevolvingCreditBalance                                               0.000      0.000      0.000     
##                                                                       (0.000)    (0.000)    (0.000)    
##   Term                                                                            0.000***   0.000***  
##                                                                                  (0.000)    (0.000)    
##   DebtToIncomeRatio                                                                          0.000     
##                                                                                             (0.000)    
## -------------------------------------------------------------------------------------------------------
##   R-squared                     0.9        0.9        0.9        0.9        0.9        0.9        0.9  
##   adj. R-squared                0.9        0.9        0.9        0.9        0.9        0.9        0.9  
##   sigma                         0.0        0.0        0.0        0.0        0.0        0.0        0.0  
##   F                        841560.2   422292.0   283543.9   212659.1   170129.3   152823.0   120429.8  
##   p                             0.0        0.0        0.0        0.0        0.0        0.0        0.0  
##   Log-likelihood           201226.8   201365.5   201641.2   201641.8   201642.7   204547.9   187932.5  
##   Deviance                     43.3       43.1       42.9       42.9       42.9       40.0       35.7  
##   AIC                     -402447.6  -402723.0  -403272.3  -403271.6  -403271.5  -409079.8  -375847.0  
##   BIC                     -402419.6  -402685.6  -403225.6  -403215.5  -403206.0  -409005.0  -375763.7  
##   N                         84853      84853      84853      84853      84853      84853      77557    
## =======================================================================================================

Below is the plot of the results of model m7.

In the above Graph, we have 4 plots namely -

Residuals vs Fitted plot - we find that the residuals are equally spread around a horizontal line which increases along the x axis. The residuals does not have non-linear patterns.

Normal Q-Q plot - The plot shows that residuals are normally distributed with the exception in the tails.

Scale-Location plot - The residuals slightly spread more in the upward direction along the x-axis. Therefore, the red smooth line is slightly curved.

Residuals vs Leverage - All the cases are inside Cook’s distance lines. There are no cases outside of a dashed line, so there are no cases which are influential to the regression result.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

BorrowerRate, ProsperRating.numeric and CreditScore.avg are the major features which have a strong relationship with each other. There are other few features like OpenCreditLines, Term, DebtToIncomeRatio which have a little correlation with the other strong features.

ProsperRating.numeric is getting affected by CreditScore.avg, StatedMonthlyIncome, InquiriesLast6MOnths, DebtToIncomeRatio, RevolvingCreditBalance and OpenCreditLines.

Were there any interesting or surprising interactions between features?

Variables behaviour was as expected by me.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes, Created a model based on strong and weak features. But weak features didn’t had much impact on the Multiple R2 value. Out of all, ProsperRating.numeric alone itself had a impact on Mutiple R2 value and it is 0.9084

CreditScore.avg increased this value to 0.9087, OpenCreditLines increased the above value to 0.9093, Term increasd the above value to 0.9153 and DebtToIncomeRatio increased the above value to 0.9158

In the final end result of the model, the value of Multiple R2 is 0.9

Strength of the model is that it inclused the most important variable - ProsperRating.numeric and other impacting variables for the calculation of linear model.

Limitations is that the dataset have 81 variables out of which only 20 were selected for this analysis. There might be another variables in the left out variables which might have a major and strong impact on the result of the model.


Final Plots and Summary

Plot One

Description One

The above graph shows the relationship of Borrower Rate by Average Credit Score and Prosper Rating. There appears to be downward trend of Borrower Rate as the average credit score increases from 600 to 900. Along with this, we an see that Propser Rating plays an important part in determining the Borrower Rate. For same credit score, a better Prosper Rating helps in lowering down the Borrower Rate.

Plot Two

Description Two

The above graph shows the relationship of Borrower Rate by Prosper Rating and Term. There is a strong negative correlation between Borrower Rate and Prosper Rating. See how the Borrower Rate decreases as the Prosper Rating increases. Also, Loan Term plays an important part in the determining the Borrower Rate for each category of Propser Rating. Loan Term of 12 months has lowest Borrower rate in comparison to 36 and 60 months Loan Term in each category of Propser Rating. Borrower Rate for 60 Months Term loan appears to be slighty high than 36 months Loan Term.

Plot Three

Description Three

The above graph shows the relationship between Loan Amount and Prosper Rating and Borrower Rate. We can observe that the Loan amount increases as the Prosper Rating increases. Also, if we facet the graph by Borrower Rate, we could observe that the people with high Prosper Rating are provided higher amount of loan at low Borrower Rate of 0-10%. Similarly, people with very low Prosper Rating and provided very low amount of loan at high Borrower Rate of 30-40%. So we can inference by observing the above graph that as the Prosper Rating lowers down, the amount of loan provided to the people also lowers down while Borrower Rate goes up.


Reflection

The Prosper data has a lot of variables, for the scope of this project I had limited the number of variables to investigate. The first part is to select which variables to investigate. After much thought, I used the variable that represents the personal and credit information about an individual. Lenders would definitely like to know about these variables before providing the loan to Borrower.

I wanted to show the relationship between the variables with Borrower Rate, for instance Debt to Income ratio vs Borrower rate, Credit Score vs Borrower Rate, Open Credit Lines vs Borrower Rate, etc.

After graphing most of the selected variables, Prosper Rating appeared to be most important and influencing variable for Borrower Rate. Both of them are negatively correlated. Credit Score also appeared to influencing Borrower Rate after a value of 600 which then have negative correlation with Rate.

Also, Open Credit lines, Revolving Balance, Term, Monthly Income, Debt To Income Ratio, Inquiries in the Last 6 Months were the other factors which slightly affect Borrower Rate.

In the end, I would like to say that this anlaysis considered only few selected factors. To get the entire picture of how Borrower Rate and Prosper Rating are getting influenced, entire dataset with 81 variables has to be worked upon. Variables related to Delinquencies, BankCard, ScorexChange, etc. could be the other variables which might have an effect on Borrower Rate or Prosper Rating.